In order to make a first evaluation of the given datasets, we compute some basic metrics.
For more information on the metrics and also the extraciton of metrics for the smaller datasets look at:
`Evaluation metrics for picking an appropriate data set for our goals.ipynb `
For the importing the four largest datasets to postgresql and evaluating their metrics look at:
`Importing the large data sets to psql and computing their metrics.ipynb`
Finally, the evaluated metrics of all datasets are exported to metadata
and imported here to visualize.
In [1]:
def percentage(some_float):
return '%i%%' % int(100 * some_float)
def metrics_comparison_matrix(reviews_df):
return reviews_df.apply(
lambda row:
[ percentage(row[i]) for i in range(0, 5) ]
+ [ int(row[5]), row[6], row[7] ],
axis=1)
In [2]:
import pandas as pd
small_data_metrics = pd.read_csv('./metadata/initial-data-evaluation-metrics.csv')
large_data_metrics = pd.read_csv('./metadata/large-datasets-evaluation-metrics.csv')
In [3]:
metrics = metrics_comparison_matrix(
pd.concat([ small_data_metrics, large_data_metrics ])
.set_index('dataset_name'))
In [5]:
metrics.to_csv('./metadata/all-metrics-formatted.csv')
metrics
Out[5]: